There are three primary taxonomy-specific criteria that should be met when creating an ideal set of reference sequences for evaluating metabarcoding primers. I say “taxonomy-specific” because there are obviously other considerations that deal with sequence characteristics, such as quality and the presence of primer binding sites, but these will be dealt with separately.
The three taxonomy-specific criteria are:
When evaluating potential primers for metabarcoding experiments it is important to have reference sequences representative of every taxon of interest. The number of sequences of each taxon at each taxonomic level should be approximately equal, or perhaps more accurately, proportional to the diversity within that taxon. If the sequence of a given locus correlates with characters used to determine taxonomy, as is the case for an ideal barcoding locus, the number of sub-taxa should be proportional to diversity of a given taxon.
Random samples of online sequence databases typically do not have these characteristics, even those designed for metabarcoding. Sequences are not available for many taxa and the numbers of sequences vary dramatically for different taxa. For these reasons, a simple random sample of sequences for a given taxon from an online database will often not constitute an ideal reference sequence set. Therefore, metacoder includes functions to extract a sample of reference sequences that meet these criteria from a larger set of sequences.
taxonomic_sampletaxonomic_sample is used to sample observations according to their taxonomic classifications. If we look data below, we can see that Ascomycota and Basidiomycota comprise the majority of the sequences, even though three other fungal phyla are also present. Depending on the goal, this might or might not be a problem. The overrepresentation of Agaricales is likely a bigger problem.
library(metacoder)
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
node_label = name)
To try to reduce the Agaricales overrepresentation, we will sub-sample any order with greater than 20 sequences to 20 sequences. Similarly, species will be sub-sampled to 5 sequences each to avoid any one species introducing bias. The max_counts = c("4" = 20, "7" = 5) option is used to implement the sub-sampling limits for each taxonomic rank.
It might also be desirable to not included sequences from underrepresented taxa. Too few sequences means that the diversity of the taxon cannot be determined and questionable sequences are not as obvious without others of the same classification to compare them to. For this reason, we will use the option min_counts = c("7" = 3), so observations from species with less than 3 observations will not be included.
subsampled <- taxonomic_sample(unite_ex_data_3,
max_counts = c("3" = 20, "6" = 5),
min_counts = c("6" = 3))
We can now view the difference between the original and sub-sampled data sets using plot_taxonomy again.
heat_tree(subsampled,
node_size = n_obs,
node_color = n_obs,
node_label = ifelse(n_supertaxa %in% c(6, 3), n_obs, NA),
edge_label = ifelse(n_supertaxa == 3, name, NA))
Note how the taxonomy information for taxa with no observations is preserved. This can be removed using subset:
filter_taxa(subsampled, n_obs > 0, subtaxa = FALSE) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = ifelse(n_supertaxa %in% c(6, 3), n_obs, NA),
edge_label = ifelse(n_supertaxa == 3, name, NA))
sample_n_obs and sample_n_taxaIf you want to have the random sample be a specified number of observations or want more control over how often different observations or taxa get sampled, you can use sample_n_obs and sample_n_taxa.
sample_n_obs(unite_ex_data_3, size = 100, taxon_weight = 1 / n_obs) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = n_obs,
edge_label = name)
These functions also support sampling with repacement:
sample_n_obs(unite_ex_data_3, size = 10000, taxon_weight = 1 / n_obs, replace = TRUE) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = n_obs,
edge_label = name)
There is more information on sample_n_obs and sample_n_taxa in the “Manipulating taxonomic data” tutorial.
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] metacoder_0.1.2 knitcitations_1.0.7 knitr_1.14
##
## loaded via a namespace (and not attached):
## [1] igraph_1.0.1 Rcpp_0.12.9 magrittr_1.5 munsell_0.4.3
## [5] colorspace_1.2-7 R6_2.2.0 bibtex_0.4.0 stringr_1.1.0
## [9] httr_1.2.1 plyr_1.8.4 dplyr_0.5.0 tools_3.3.1
## [13] gtable_0.2.0 DBI_0.5-1 htmltools_0.3.5 lazyeval_0.2.0
## [17] assertthat_0.1 yaml_2.1.13 rprojroot_1.2 digest_0.6.12
## [21] tibble_1.2 RJSONIO_1.3-0 ggplot2_2.2.1 reshape2_1.4.2
## [25] RefManageR_0.13.1 formatR_1.4 bitops_1.0-6 RCurl_1.95-4.8
## [29] evaluate_0.10 rmarkdown_1.3 labeling_0.3 stringi_1.1.2
## [33] scales_0.4.1 backports_1.0.5 XML_3.98-1.4 lubridate_1.6.0
Comments